Log-linear Generation Models for Example-based Machine Translation

نویسندگان

  • Zhanyi Liu
  • Haifeng Wang
  • Hua Wu
  • Dong Cheng
چکیده

This paper describes log-linear generation models for Example-based Machine Translation (EBMT). In the generation model, various knowledge sources are described as the feature functions and are incorporated into the log-linear models. Six features are used in this paper: matching score and context similarity, to estimate the similarity between the input sentence and the translation example; word translation probability and target language string selection probability, to estimate the reliability of the translation example; language model probability and length selection probability, to estimate the quality of the generated translation. In order to evaluate the performance of the log-linear generation models, we build an English-to-Chinese EBMT system with the proposed generation model. Experimental results show that our EBMT system significantly outperforms both a baseline EBMT system and a phrase-based SMT system. Introduction In example-based Machine Translation (EBMT), translation generation plays a crucial role (Somers, 1999; Hutchins, 2005). For EBMT systems, there are two major approaches to selecting the translation fragments and to generating the final translation. Semantic-based approaches obtain an appropriate target language fragment for each part of the input sentence by means of a thesaurus. The final translation is generated by recombining the target language fragments in a predefined order (Aramaki et al., 2003; Aramaki & Kurohashi, 2004). This approach does not take into account the transition between fragments. Therefore, the fluency of the translation is weak. Statistical approaches select translation fragments with a statistical model (Knight & Hatzivassiloglou, 1995; Kaki et al., 1999; Callison-Burch & Flournoy, 2001; Akiba et al., 2002; Hearne & Way, 2003&2006; Imamura et al., 2004; Badia et al., 2005; Carl et al., 2005). The statistical model can solve the transition problem by using n-gram co-occurrence statistics. However, this approach does not take into account the semantic relations between the translation example and the input sentence. As a result, the accuracy of translation is poor. Liu et al. (2006) presented a hybrid generation model which combines these two approaches. In this paper, we propose log-linear generation models for EBMT. Unlike the hybrid model presented in (Liu et al., 2006), our generation model uses various knowledge sources that are described as feature functions. The feature functions are incorporated into the log-linear models (Och & Ney, 2002&2004). In this paper, we use six feature functions. Matching score and context similarity are used to estimate the similarity between the input sentence and the source part of the translation example. Word translation probability and target language string selection probability are used to estimate the reliability of the translation example. Language model probability and length selection probability are used to estimate the quality of the generated translation. Experimental results show that the performance of the EBMT system is significantly improved by using the log-linear generation models. Such an EBMT system also achieves a significant improvement of 0.0378 BLEU score (17.2% relative) as compared with a phrase-based SMT system. The remainder of the paper is organized as follows. The next section briefly introduces the Tree String Correspondence based EBMT method. And then we describe the log-linear generation models and the feature functions. After that, the search algorithm is described. Finally, we present the experimental results and conclude this paper. Tree String Correspondence Based EBMT In this paper, we improve the Tree String Correspondence (TSC) based EBMT method (Liu et al., 2006) with the log-linear generation models. Definition of TSC Given a phrase-structure tree T and a subtree Ts of T, Ts is a matching-tree of T if Ts satisfies the following conditions: 1. There is more than one node in Ts. 2. In Ts, there is only one node r (the root node of Ts) whose parent node is not in Ts. All the other nodes in Ts are descendant nodes of r. 3. For any node n in Ts except r, the sibling node of n is also in Ts. Here, each node of the parse tree is labeled with its headword and category. TSC is defined as a triple , where t is a matchingtree of the source language parse tree; s is a target language string corresponding to t; c denotes the word correspondence, which consists of the links between the leaf nodes of t and the substrings of s. If the leaf node of the matching-tree in TSC is a nonterminal node of the parse tree, then this kind of leaf node is also called a substitution node. The correspondence in the target language string of the substitution node is called a substitution symbol. The substitution symbol can represent a single word, or phrase that can be expanded by other matching-tree. During translation, for each Figure 1. Examples of TSC substitution node, its corresponding substitution symbol will be replaced by the translation candidate of the TSC whose root node corresponds to this substitution node. A TSC is used to represent either of static translation examples or dynamic translation example fragments. In the TSC-based EBMT system, a preprocessed translation example is statically stored as a TSC in the example database. During the translation, a translation example fragment, which is identified to match the input, is represented as a TSC. In this paper, we use English-to-Chinese MT as a case study. Figure 1 shows three examples of English-toChinese TSC. TSC (a) indicates the following translation example: Mary borrowed a book from her friend. 玛丽 从 她 朋友 那里 借 了 一 本 书 。 (Mary from her friend there borrow a book .) In this TSC, the matching-tree of the source language and the target language string are composed of the source part and the target part of the translation example, respectively. TSC (b) and (c) are derived from TSC (a). The matchingtree of TSC (b) and (c) matches the subtrees of the TSC (a) that are rooted at Node 4. The matching-tree of TSC (b) matches all descendant nodes of Node 4 and no substitution nodes are included in matching-tree. The target language string corresponding to the subtree is considered the translation of the matching-tree. Different from TSC (b), the leaf nodes 6 and 11 in the matchingtree of TSC (c) are the non-terminal nodes of the matching-tree of TSC (a). The two nodes in this TSC are the substitution nodes. Their corresponded parts in the target language string are the substitution symbols "". Thus, the target language string of TSC (c) consists of the target language words and the substitution symbols. Two TSCs are homologous if their source language matching-trees are the same, which means that the same source language matching-tree can be translated into the different target language strings. A TSC forest matching a parse tree means that the source language matching-trees of the TSC forest can exactly compose the parse tree. For TSC T1 and T2 in the TSC forest, if the root node of T1 corresponds to a substitution node of TSC T2, then T1 is the child TSC of T2 and T2 is the parent TSC of T1. EBMT Based on TSC In the EBMT system based on TSC, the translation example is presented as the TSC. For an input sentence to be translated, it is first parsed into a tree. Then the TSC forest which best matches the input tree is searched out. Finally, the translation is generated by combining the target language strings of TSCs. For the parse tree of the input sentence, there are many TSC forests that match the parse tree. In Liu et al. (2006), a TSC can better match a parse tree if the TSC has more nodes or the TSC has higher matching score with the parse tree. Thus, a TSC forest best matches a parse tree if the TSC forest has the highest matching score in the TSC forest candidates. A greedy tree-matching algorithm was used to search for the TSC forest that best matches the parse tree of the input sentence. The algorithm first 玛丽 ○0 TOP (borrowed)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Minimum error training of log-linear translation models

Recent work on training of log-linear interpolation models for statistical machine translation reported performance improvements by optimizing parameters with respect to translation quality, rather than to likelihood oriented criteria. This work presents an alternative and more direct training procedure for log-linear interpolation models. In addition, we point out the subtle interaction betwee...

متن کامل

Building a Large Machine-Aligned Parallel Treebank

This paper reports on-going work on building a large automatically treealigned parallel treebank in the context of a syntax-based machine translation (MT) approach. For this we develop a discriminative tree aligner based on a log-linear model with a rich feature set. We incorporate various languageindependent and language-specific features taking advantage of existing tools and annotation. Our ...

متن کامل

A finite-state framework for log-linear models in machine translation

Log-linear models represent nowadays the state-of-the-art in statistical machine translation. There, several models are combined altogether into a whole statistical approach. Finite-state transducers constitute a special type of statistical translation model whose interest has been proved in different translation tasks. The goal of this work is to introduce a finite-state framework for a log-li...

متن کامل

Vector Space Models for Phrase-based Machine Translation

This paper investigates the application of vector space models (VSMs) to the standard phrase-based machine translation pipeline. VSMs are models based on continuous word representations embedded in a vector space. We exploit word vectors to augment the phrase table with new inferred phrase pairs. This helps reduce out-of-vocabulary (OOV) words. In addition, we present a simple way to learn bili...

متن کامل

Function Word Generation in Statistical Machine Translation Systems∗

Function words play an important role in sentence structures and express grammatical relationships with other words. Most statistical machine translation (SMT) systems do not pay enough attention to translations of function words which are noisy due to data sparseness and word alignment errors. In this paper, a novel method is designed to separate the generation of target function words from ta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007